High-dimensional Proximity Joins

نویسندگان

Kyuseok Shim

Ramakrishnan Srikant

Rakesh Agrawal

چکیده

Many emerging data mining applications require a proximity (similarity) join between points in a high-dimensional domain. We present a new algorithm that utilizes a new data structure, called the -kd tree, for fast spatial proximity joins on high-dimensional points. This data structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of nding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence the proposed data structure scales to high-dimensional data. We analyze the cost of the join for the -kd tree and the R-tree family, and show that the -kd tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life datasets, shows that proximity join using the -kd tree is typically 2 to 40 times faster than the R tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the -kd tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for highdimensional proximity joins, but do not match the performance of the -kd tree.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Algorithms for High-Dimensional Proximity Joins

We consider the problem of parallelizing highdimensional proximity joins. We present a parallel multidimensional join algorithm based on an the epsilon-kdB tree and compare it with the more common approach of space partitioning. An evaluation of the algorithms on an IBM SP2 shared-nothing multiprocessor is presented using both synthetic and real-life datasets. We also examine the effectiveness ...

متن کامل

A Fast Algorithm for high-dimensional Similarity Joins

Many emerging data mining applications require a similarity join between points in a highdimensional domain. We present a new algorithm that utilizes a new index structure, called the -kdB tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of nd...

متن کامل

Comparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data

Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of low-dimensional vector data. In this paper, we revisit and investigate ...

متن کامل

Fast similarity join for multi-dimensional data

To appear in Information Systems Journal, Elsevier, 2005 The efficient processing of multidimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memor...

متن کامل

Class proximity measures - Dissimilarity-based classification and display of high-dimensional data

For two-class problems, we introduce and construct mappings of high-dimensional instances into dissimilarity (distance)-based Class-Proximity Planes. The Class Proximity Projections are extensions of our earlier relative distance plane mapping, and thus provide a more general and unified approach to the simultaneous classification and visualization of many-feature datasets. The mappings display...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1998

High-dimensional Proximity Joins

نویسندگان

چکیده

منابع مشابه

Parallel Algorithms for High-Dimensional Proximity Joins

A Fast Algorithm for high-dimensional Similarity Joins

Comparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data

Fast similarity join for multi-dimensional data

Class proximity measures - Dissimilarity-based classification and display of high-dimensional data

عنوان ژورنال:

اشتراک گذاری